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Abstract 

This paper studies a Markov chain for phylogenetic reconstruction which uses a 
popular transition between tree topologies known as subtree pruning-and-regrafting 
(SPR). We analyze the Markov chain in the simpler setting that the generating tree 
consists of very short edge lengths, short enough so that each sample from the generat- 
ing tree (or character in phylogenetic terminology) is likely to have only one mutation, 
and that there enough samples so that the data looks like the generating distribution. 
We prove in this setting that the Markov chain is rapidly mixing, i.e., it quickly 
converges to its stationary distribution, which is the posterior distribution over tree 
topologies. Our proofs use that the leading term of the maximum likelihood function 
of a tree T is the maximum parsimony score, which is the size of the minimum cut 
in T needed to realize single edge cuts of the generating tree. Our main contribution 
is a combinatorial proof that in our simplified setting, SPR moves are guaranteed to 
converge quickly to the maximum parsimony tree. Our results are in contrast to recent 
works showing examples with heterogeneous data (namely, the data is generated from 
a mixture distribution) where many natural Markov chains are exponentially slow to 
converge to the stationary distribution. 



1 Introduction 

We study Markov Chain Monte Carlo (MCMC) methods for Bayesian inference of phy- 
logeny. We begin by presenting the relevant background material by defining phylogenetic 
trees, evolutionary models (in Section 1.1), and the associated MCMC methods (in Section 
1.2). We refer the interested reader to Semple and Steel [1">] for a more comprehensive 
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introduction to the mathematics of phylogeny. FinaUy, we present our results and discuss 
related work in Section 1.3. 

A phylogenetic tree is an unrooted tree T on n leaves (called taxa, corresponding to 
n species) where internal vertices have degree three. Let E(T) denote the edges of T 
and V{T) denote the vertices. In the phylogenetic reconstruction problem, we observe a 
collection of labelings of the leaves of T from a set 0, and our goal is to infer the tree T 
from which they were generated from. For example, if = {A, C, G, T} then we are given 
(aligned) DNA sequences for n species, and we are trying to determine the tree describing 
the evolutionary history of the present-day species. 

1.1 Evolutionary models and Maximum Likelihood 

The labelings on the leaves of T are the projection of labelings on all vertices of T, and 
these labelings of V are generated in the following manner. There is a stochastic process 
along edges of T (e.g., modeling the evolutionary process of DNA substitutions) which is 
defined by a continuous-time Markov chain. Thus, for each edge e £ T there is a x 
rate matrix Q and a time tg > 0, which is called the branch length of e. In this paper, 
as is typical in the phylogenetic setting, we assume there is a single rate matrix Q that is 
common to all edges. The rate matrix is assumed to be reversible with respect to some 
distribution vr on il. Hence, fix vr as the stationary vector for Q (i.e., ttQ = 0). (The 
matrix Q is usually scaled so that we expect one "substitution" (i.e., change) per unit of 
time.) The rate matrix Q defines a continuous time Markov chain, and together with tg 
defines a transition matrix on edge e: 



The matrix Pg is a stochastic matrix of size \^}\ x and thus defines a discrete-time 
Markov chain, which is time-reversible with stationary distribution vr, i.e., irPe = n, and 
T^i{Pe)ij = TTj{Pe)ji (for aU i,j G i}). 

The simplest four-state (i.e., \^}\ = 4) evolutionary model has a single parameter for 
the off-diagonal entries of the rate matrix Q; this model is known as the Jukes-Cantor 
model. The most general reversible four-state model is the GTR (general time reversible) 
model. For |r2| = 2 (often studied for mathematical interest), the model is binary and the 
rate matrix has a single parameter; this model is known as the CFN (Cavender-Farris- 
Neyman) model. See Felsenstein [t i] or Yang [, :] for an introduction to these evolutionary 
models. 

Given T, the rate matrix Q and the branch lengths t = (te)ee£;(T)) then define the 
following distribution on labelings of the vertices of T. Let Pe = exp{teQ) for e S E{T)- 
We first orient the edges of T away from an arbitrarily chosen root r of the tree. (We 
can choose the root arbitrarily since each Pe is reversible with respect to vr.) Then, the 
probability of a labeling I : V{T) — is 



Pe = exp(teQ) = / + teQ + tlQ"^ l2\ + t^Q^/S! + . . . 



(1) 



(2) 
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The distribution jjlrp q ^ can be generated in an equivalent algorithmic manner. Choose 
i{r) from vr. Then for each edge e = {u,v) E E{T), given an assignment for exactly one 
of the endpoints, say i{u), choose i{v) from the distribution defined by the row of Pg 
corresponding to the label £{u). 

Let lJ'T,Q,t be the marginal distribution of fi'rp gt on the labelings of the leaves of T 
(thus ^J'T,Q,t is a distribution on where n is the number of leaves of T). Fix T* with 
parameters Q* and t* as the generating tree. The goal of phylogeny reconstruction is to 
reconstruct T* (and possibly Q* and t*) from iJ,T*,Q*,t* (more precisely, from independent 
samples from /iT*,Q*,f )• 

Let Q denote a set of rate matrices with non-zero entries where Q* £ Q. The set Q is 
the set of possible rate matrices. The set can be arbitrary, usually it is determined by the 
model considered (e. g., for the Jukes-Cantor model Q would contain rate matrices whose 
off-diagonal entries are the same). One often assumes that the rate matrix Q* is known. 
In this case we would set Q = {Q*}- On the other hand, our results also apply if one sets 
Q to be the set of all rate matrices with non-zero entries. 

We consider the likelihood of a tree T as, the maximum over rate matrices Q G Q 
and over assignments of non-zero branch lengths t to the edges of T, of the probability 
that the tree {T,Q,t) generated fj,. More formally, the maximum expected log-likelihood 
of tree T for distribution fi* is defined by 

CT{^^*) = sup sup£T,Q.t(^*), (3) 
QeQ t 

where 

JC-T,Q,ti^^*) = Yl ^^*iy)H^T,Q,tiy)). (4) 

yen" 

For a set of characters D = {Di, . . . ,Di^) where Di £ Jl", define the log-likelihood of 
a tree T as 

CtCD) = sup sup ln(/XT,Q,t (D)) 
QGS t 

N 

= sup sup Vln(/iT,Q,t(A)) • 

Our goal is to sample from the distribution on the set of phylogenetic trees with n 
leaves where the weight of a tree is Ct(D)- In Section 3 we will look at the straightforward 
extension to the setting where we are given a prior on trees and parameters Q,t, and our 
goal is to sample from the posterior distribution. 

1.2 SPR Markov Chain 

We analyze a Markov chain using transitions made by subtree pruning-and-regrafting 
(SPR). SPR transitions are a natural combinatorial transition, which is also popular in 
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practice. In Section 4 we discuss several other well-studied choices for the transitions. 
Here we consider trees weighted by their maximum likelihood. In Section 3 we discuss 
how the Markov chain definition and our main result extends to sampling the posterior 
distribution. 

An SPR transition from a tree T works by choosing an (internal or terminal) edge 
e = {u,v). If e is an internal edge we consider one of the two subtrees in T \ e, either 
the subtree Su containing u or the subtree Sv containing v. Let Su denote the selected 
subtree. If e is a terminal edge, let Su be the endpoint of e that is a leaf. Let T' denote 
the tree formed by removing Su from T, in particular, we remove Su and edge e from T 
and "smooth away" the vertex v (that is, contract one of the two adjacent edges). We 
then choose an edge e* in T' and we attach S onto e* by adding a new intermediate vertex 
along e*. See Figure 1 for an illustration. Let SPR(T, Su,e*) denote the tree resulting 
from the above transition. 



B C B C 




Figure 1: Illustration of an SPR transition. The randomly chosen edge e is marked by 
an arrow. The subtree containing B and C is pruned, and then regrafted at the edge e* 
marked by a starred arrow. The resulting tree is illustrated. 

We analyze the following Markov chain which chooses a random subtree S to prune, and 
then chooses an edge to regraft S along based on the maximum likelihood of the resulting 
tree. This Markov chain is an analogous to heat bath chains studied in Statistical Physics 
(as opposed to Metropolis chains), e.g., see [J], thus we refer to the below chain as the 
Heat Bath SPR Markov Chain. Here is the formal definition of the transitions Tt — s- Tj+i 
of the Heat Bath SPR Markov Chain. 

From a tree Tt at time t we proceed as follows. 

1. Choose a random subtree S oi Tt, by choosing a random edge e and then choosing 
one of the two subtrees hanging off of e. Let T' denote the tree formed by deleting 
S and e from T. 

2. For each edge e* of tree T' , let w{e*) = £^(D) where f = SPR{T, S, e*) is the tree 
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formed by pruning S from T and regrafting S onto edge e* . Let uj be the distribution 
on edges of T' where u){e*) = w{e*)/Z and Z = J2e'eE{T') w[e'). 

3. Sample an edge e* from the distribution lo on edges of T' . 

4. Graft S onto edge e* and move to this new tree, i. e., set Tt+i = SPR{T, S, e*). 

We now verify that the above Markov chain is ergodic and reversible with respect to 
the distribution ir on trees where 7r(T) oc Ct(D), and thus vr is also the unique stationary 
distribution. Let Ti and T2 be neighboring states of the Markov chain. Let S be the tree 
that is pruned and regrafted to obtain T2 from Ti. Note that the same tree S can be 
pruned and regrafted to obtain Ti from T2 . The transition probability from Ti to T2 is the 
probability of choosing S in step 1 times £t2(D)/Z. Similarly the transition probability 
from T2 to Ti is the probability of choosing S in step 1 times Cti(D)/Z (note that Z is 
the same in both cases since pruning S results in the same tree T'). The detailed balance 
condition is satisfied for vr(T) oc £t(D) and hence it is the unique stationary distribution. 

Let dTvifJ-jt^) denote the (total) variation distance between a pair of probability dis- 
tributions fi and u defined on the same finite, discrete space, and let P^(Tq, •) denote the 
t-step distribution of the Markov chain from initial state Tq. The mixing time T^ix is 
defined as 

Tmix = maxminjt : dTy(-P*(2o, •),7r) < l/2e}. 

To 

which is the time to reach variation distance < l/2e of the stationary distribution from 
the worst initial state. Note, it is straightforward to then "boost" so that for any 5 > 0, 
after rmixlii(l/(5) steps we are within variation distance < 5 of vr, from the worst initial 
state (see Aldous [1]). 

1.3 Results on MCMC for Phylogenetic Reconstruction 

MCMC algorithms are an important tool for phylogenetic reconstruction. MrBayes [10] 
is a popular program that relies on MCMC methods for Bayesian inference of phylogeny. 
MrBayes uses a sophisticated variant of MCMC known as Metropolis- Coupled MCMC [8]. 

For statistical inference problems, such as phylogenetic reconstruction, it is often easy 
to design appropriate MCMC algorithms, such as the above Markov chain we defined using 
SPR transitions, which converge in the limit over time to the desired posterior distribution. 
However, the computational efficiency of these methods rely on their fast convergence to 
the posterior distribution. Since theoretical results are typically lacking, heuristic methods 
are used to measure convergence to the desired distribution. Hence, there are often no 
rigorous guarantees on the scientific computations which rely on the random samples 
produced by the MCMC methods. Our goal is to provide some theoretical understanding of 
settings where MCMC methods for phylogenetic reconstruction are provably fast and hence 
yield accurate results, and settings where the MCMC methods are slow and consequently 
the samples may be misleading. 
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There are several works with computational experiments on the convergence rates 
of MCMC algorithms for phylogenetic reconstruction, e.g., see the recent works [3, II]. 
There is relatively little theoretical work. Diaconis and Holmes [•")] proved fast convergence 
of a Markov chain to the uniform distribution over phylogenetic trees. Recently, several 
works have shown examples of heterogeneous data where MCMC algorithms are provably 
slow to converge. Mossel and Vigoda [13, 11] proved slow convergence for a class of 
examples with data arising from a uniform mixture of a pair of 5-taxa trees (with different 
topologies). Stefankovic and Vigoda [19, 20] proved slow convergence for a class of mixture 
examples from a pair of 5-taxa trees that share the same topology but differ in their branch 
lengths. In these slow mixing results, the convergence time is exponential in the number 
of characters (i.e., sequence length). 

In this paper we show fast convergence for data from a homogenous source of closely 
related species. In particular, for data generated from a single tree (of any size) when 
all the branch lengths are sufficiently short, we prove fast convergence. The requirement 
of sufficiently short branches is for our proof technique, but it is important to note that 
the slow mixing results mentioned earlier [13, 14, 19, 20] require, in an analogous manner, 
sufficiently short branches. If one searches for the tree with the maximum likelihood (or 
maximum a posteriori probability) our methods show that in our setting of very short 
branch lengths, the space of trees (connected by the SPR moves) has no local maxima and 
hence one can find the optimal tree using hill climbing. 

For simplicity, we present our results here for the case where the weight of a tree is the 
maximum likelihood of generating the given data D where the maximum is over a rate 
matrix Q (common to all edges) and a set of branch lengths t. This is closely related to 
the posterior distribution when the priors are uniformly distributed. Our results extend 
to (5-regular priors, which are priors that are lower bounded by some 6 > 0, see Section 3 
for a discussion on the extension of our results to sampling the posterior distribution. We 
are interested in the mixing time Tmixj defined as the number of steps until the chain is 
within variation distance < 1 /2e of the stationary distribution. 

We prove that the Heat Bath SPR Markov Chain converges quickly to its stationary 
distribution when the data is generated from a tree T* where all of the branch lengths are 
sufficiently small, and there are sufficiently many samples generated from T* . Here is the 
formal statement of our main result. 

Theorem 1. Consider any reversible A-state model, any phylogenetic tree T* on n taxa 
and any rate matrix Q* with no zero entries. For all amin > 0, there exists eo > 0, such 
that for all < £ < eo and any choice of branch lengths t* G (amine, e) for e G E(T*), 
there exists Nq > where the following holds. 

For a data set with N > Nq characters, each chosen independently from the distribution 
HT*,Q*,t*, then, with probability > 1 — exp(— -v/^V) over the data generated, the Heat Bath 
SPR Markov Chain has mixing time < 50n. 

Since there is considerable quantification in the above theorem we will take a moment 
to dissect it at a high-level. First off, the requirement that > Nq comes from needing 
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the data to look very much hke the generating distribution n* = /iT*,Q*,f • Therefore, 
how much data we need depends on several quantities, such as the minimum probabihty 
of a configuration arising in /i*, which depends on the minimum branch length and the 
minimum rate in Q* . Hence, Nq depends on amin and Q* , and is exponential in n. A 
somewhat related question has been studied by Steel and Szekely [16, 17, 18] on how large 
A'^ needs to be so that the maximum likelihood tree is the generating tree T* . In their 
results one also needs N to be exponential in n. 

Our proof uses the fact that in the setting of Theorem 1 , where the branch lengths are 
sufficiently short, the leading term of the maximum likelihood function is actually maxi- 
mum parsimony. Such a result is well-known in the mathematical phylogeny community, 
and was first observed by Felsenstein [7]. We require a more detailed statement of such 
a result, which we present in Lemma 2 in Section 2.2. Our main technical contribution 
is a combinatorial proof that, in the setting of Theorem 1, SPR moves can be used in a 
greedy manner to quickly find the maximum parsimony tree. This result is presented in 
Section 2.3. Finally, in Section 2.4 we show how Theorem 1 follows in a straightforward 
manner from these combinatorial results. In Section 3 we discuss how Theorem 1 extends 
to Bayesian inference. We make some concluding remarks in Section 4. 

2 Proof of Rapid Mixing 

2.1 Overview 

To prove Theorem 1 we will analyze, for every tree T, the maximum expected log-likelihood 
£t(/U*) where fj,* = HT*,Q*,t* (recall that CT{^J'*) is the maximum expected log-likelihood 
of T maximized over all rate matrices Q and all edge lengths t, see (3)). To analyze CxifJ-*) 
we will consider the dominant terms of the likelihood function. We will show that 

CrifJ-*) = £{7r*) + A{T)e In e + o{e In e), 

where '^(vr*) is the entropy of the stationary distribution of Q* and thus is the same for 
every T. By taking e sufficiently small the last term o(elne) can be ignored. Therefore, 
the dominant term is A{T)e In e. We will prove that the function A{T) decreases with each 
optimal SPR move. Hence, since Ine is negative, we then have that T* has the highest 
maximum expected log-likelihood, and as the Markov chain gets closer to T* the maximum 
expected log-likelihood will increase. Theorem 1 will then follow in a straightforward 
manner. 

2.2 Analyzing Likelihood 

Let T be a tree with leaves {1, . . . , n}. Let i? be a partition of the leaves into two parts 
(i?i,i?2)- Note that we only consider the partition of the leaves without any regard for 
the internal vertices. Let cutij(T) denote the size of the cut (i.e., a subset of edges) of 
minimum size that disconnects Ri from i?2- Tuffiey and Steel [21] showed that the quantity 
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cutR{T) is the parsimony score of the character corresponding to R (see also Semple and 
Steel [15, Proposition 5.1.6]), where the character corresponding to R = (i?i,i?2) assigns 
all leaves in Ri some a G and assigns all leaves in R2 some /3 G fi, /3 7^ a. 

For an edge e £ T, the removal of e splits T into two components. This induces a 
partition of the leaves of T into two parts. We will call this partition R{T, e). 

The following is the main technical tool for our results. The lemma describes the high- 
order terms of the likelihood function as e — >• 0. Throughout this paper, the asymptotic 
notations o() and 0{) are parameterized by e — ?■ 0. 

Roughly speaking, the lemma shows there is a function A{T) which plays a leading 
role in the maximum likelihood. In words, A{T) looks at each partition R of leaves in the 
generating tree T* realized by cutting a single edge e*. It then considers the minimum 
number of edges in T to realize this partition R times the branch length of e* in the 
generating tree. As mentioned in the introduction, there are earlier results which show 
that the leading term of the likelihood function is the parsimony score, e.g., Felsenstein [ ], 
and in the below lemma, the function A is the leading term of the expected parsimony 
score. We require a more detailed statement than we found in the literature. 

Lemma 2. Let T*,T be trees with leaves {1, . . . ,n} and Q* be a rate matrix reversible 
with respect to vr*. Assume that the matrix Q* is normalized (that is, Yli=/=j '^IQij — ^) ^'^^ 
that Q* has no zero entries. Let T* have branch lengths t* = a%£, where a* G [«min' o^max]; 
for all e G E{T*), where amin > 0. Let 

A = At,,^*{T)= J2 <cutn(T',e){T)- (5) 

e£E{T") 

For fi* = fJ.T*,Q*,t* the following holds: 

CrifJ-*) = f(vr*) + ^elne + 0(elnln(l/e)), (6) 

where <f (vr*) = Y2ien '^i ^^'^i ^-^ the entropy of tt* , and the constant in the O(-) is indepen- 
dent of the choice of the a* (but does depend on a^i^ and aj^^xy'- 

As a consequence of the above lemma, to analyze the expected log-likelihood on the 
tree space, when e is sufficiently small, we simply have to consider the function A = A{T). 
In the next subsection we will investigate the combinatorial properties of A. 

Before presenting the proof of Lemma 2 we give a brief outline of its proof. Let ^* 
denote the probability distribution defined by Q* and T* on assignments of labels from 
fl to the leaves. In this generating distribution fi* , to prove (6) we only need to consider 
two types of assignments. The first type are constant assignments where no substitutions 
occur and thus all leaves receive the same label i £ Q, these are denoted as ai. The second 
type are assignments obtained by a substitution along just one edge e*. In this case, the 
cut obtained by deleting edge e* plays an important role. By deleting e* from T* , the 
leaves are partitioned into two sets Ri and R2, denoted as R{T*,e*). If a substitution 
only occurs along edge e*, then the leaves in Ri will receive the same label i £ 0,, and 
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the leaves in R2 will receive another label j G 0,j 7^ i. We denote such an assignment 
by af* . Any other type of assignment requires at least two substitutions, and hence has 
probability at most O(e^), which is dominated by the 0(e lnln(l/e)) term of (6). 

For any tree T, to prove that £t(/U*) is lower bounded by the right-hand side of (6), 
we compute the expected log-likelihood of /x* for the rate matrix Q = Q* and the set of 
branch lengths t where tg = e for every edge e. For each edge e* and its corresponding 
assignment afj , the quantity cut^(2"*,e)(^) is the minimum number of edges which require 
a substitution to obtain the assignment afj on T. Hence, the quantity A = A{T) plays 
an important role when we sum over all edges e* of T* . In particular, by a calculation 
(as detailed in (13) below), the expected log- likelihood jC,T,Q,t{lJ'*), for this set of branch 
lengths t, is X^ien^i* i'^^j* + Aelne + 0(e). Since 0{e) = 0(elnln(l/e)), this implies the 
lower bound of (6). 

To obtain the upper bound of (6) we consider three cases: when the rate matrix Q has a 
stationary distribution different from Q* , when there is an edge e where tg is long (namely 
> e(ln(l/e))^), and when all edges are short. In the first case of different stationary 
distributions, by considering the constant assignments it will be easy to establish that 
there is a difference in the first term of the right-hand side of (6). When there is a long 
edge, then the constant assignments are too unlikely to occur. Finally, if all edges are 
shorter than e(ln(l/e))^, then, by calculation, we show that the expected log-likelihood is 
at most the right-hand side of (6). 

We now present the formal proof of Lemma 2. 

Proof of Lemma 2. We first make some observations about the distribution /x* . Let P* 
denote the transition matrix for Q* , as defined in (1). 
Note, for any e, any i,j G $7 where i ^ j, we have 

{P:h = Q*,a:e + 0{e^). (7) 

For any i G we have 

{p:)u = i-j2{p:h = 1+0(8). 

For i £ Q, let ai £ denote the constant assignment ai{v) = i for all leaves v. Note 
to achieve cjj in //* we assign label i to the root and then we have no substitutions, or we 
have at least two edges with substitutions. Thus, 

/^*(^.) = < n iP:h + 0{e^) = 7r*+0{e). (8) 

eeE(T*) 

For an edge e G E{T*) and i,j G ^ where i 7^ j, let a^^ G Jl" denote the assignment of 
label i to all leaves in one of the partitions of R{T* , e) and label j to all leaves in the other 
partition of R(T* , e). In this case we have: 

/x*(cTf,) = <Q:,a:e + 0(^'). (9) 
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(To see why (9) is correct, w.l.o.g., assume that the root is a leaf in the first partition of 
R{T* , e), and hence to achieve Cij we need to label the root by i and have a substitution 
on e, or at least two edges with substitutions.) 

Now we compute £r,Q,t(/^*) where tg = e for each edge e of T and Q = Q*. Again we 
will make some observations about fi = /^T,Q,t- By the same reasoning as we used for (8), 
we obtain 

/i(a,) = 7r* + 0(e). (10) 

We can obtain assignment afj on T using a substitution on cut^(T* g)(T) edges, and we 
cannot obtain this assignment with fewer substitutions. Hence, 

Therefore, 

ln/x(4) = G(l) + cutR(r*,e)(r) Ine. (12) 

In order to compute the high-order terms of CT,Q,t{lJ'*) we do not need to consider labelings 
other than and afj (the other labelings have probability O(e^) in fi*). 
Combining (9), (8), (10), and (12) we obtain 

CT,Q,t{f^*) = 0{e^lne) + Y,i< + 0{e))H< + 0{e)) 

= Oie) + ^n*lmT* + Aelne, (13) 

where in the last inequality we used the fact that Q* is normalized. This proves the lower 
bound in (6). 

It remains to prove the upper bound in (6). We will show that no rate matrix and 
no assignment of branch lengths can do better than the bound established in (13). Let 
Q be a rate matrix with stationary distribution vr. If vr 7^ vr* then we bound CT,Q,tifJ-*) 
as follows. First, note that the terms in the sum (4) are negative and hence to obtain an 
upper bound we will only consider the constant assignments. Second, the probability of 
constant assignment ai in /i* is ^Ji*{(Ti) < vr* and similarly /x((Tj) < vr. Thus 

£-T,Q,t{f^*) < ^7r*ln7ri = ^7r*ln7r* - L>i^L(7r*||7r), 

where i^xL (7r*||'/r) := X^jgQ 7r*(ln(7r*/7ri)) is the K-L divergence of tt from vr*. Since, by 
the Gibbs' inequality, the KL-divergence is positive when vr 7^ vr*, we have established the 
upper bound in (6) for the case vr 7^ tt*. 

Now we assume tt = vr*. Let t be an assignment of branch lengths to T. Let fj, = fJ.T,Q,t- 
Suppose that there exists an edge / S E{T) with branch length tj > e(ln(l/e))^. We are 



10 



going to show that such a t has a tiny log-hkehhood because of the constant leaf labehngs 
(i.e., ai,i G n). By (1), we have {Pf)u < 1 - (?mine(ln(l/e))2 + 0{e^ln{l/e))^), where 
Qmin = minijgn |Q(i, j)|- Hence, 

/u(a,) < (1 - g^ine(ln(l/£))2 ^ o(e2(ln(l/£))4)) . 

Thus 

JC-T,Q,t{f^l < 0(e) + 5^<(ln(vr,)-<Zmine(ln(l/e))2 + 0(£2(ln(l/e))4)) 

< £iTT*)-q^i^e{lnil/e)f + Oie). (14) 

As e — )• 0, (14) is smaller than the right-hand side of (6) and we are done. 

We are now left with the case in which all edges / G E{T) have branch lengths 
t/ < e(ln(l/e))^. Since we can generate the leaf labelings starting from any vertex, then 
by starting at a leaf, we see that: 

In /i (cTj )< In VTj. (15) 

Moreover, for e S E{T*), to generate afj, we need to have substitutions across all edges 
in a cut that realizes R{T*,e). Since the edges are short this happens with probability 
< (e(ln(l/e))^)'^ where k is the size of the cut. Since there are at most 2" such cuts and 
each has size at least cutji(^T*,e){T), we have that: 

lnfi{afj) = cutfi(T.,e)(T)(0(lnln(l/e)) + Ine). (16) 

Hence, 

CT,Q,t{f^*) < 0{eHne)+£{7r) 

+ E E«Qii«ee + 0(e2))cut^(r.,e)(T)(0(lnln(l/£)) + lne) 

eGE{T*) i+j 

= 0(elnln(l/e)) +f (vr) + Aelne. 

□ 

2.3 Analyzing the Cut-Distance A(T) 

In light of Lemma 2, we need to analyze how A{T) changes with SPR moves. By taking N 
sufficiently large, for each subtree 5, we will only need to analyze the effect of the optimal 
SPR move for S (optimal in terms of minimizing A{T')). 

The quantity A{T) looks at cuts obtained by single edges of T* . For a tree T, we 
classify the edges of T* as good or bad if their corresponding cut in T* is realizable in T 
by cutting a single edge. More precisely, let 

GOODt.(T) = {e* G E{T*) : there exists e e E{T) where R{T,e) = R{T*,e*)}, 
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be the set of good edges for T. Let BADr*(T) = E{T*) \ GOOBr-iT). 

The following lemma says that for every tree T obtained from T by an SPR move 
using S, if T has more bad edges than T, then this was not the optimal SPR move using 
S. Namely, there is a tree T' which is also obtained from T by an SPR move using S, and 
T' is such that A{T') < A(T). (More precisely, each term in A{T') is less than or equal 
to the corresponding term in A(T) and there is a term in A{T') which is strictly smaller 
than the corresponding term in A{T)). Our proof has some similarity to those of Bruen 
and Bryant [2] which connect the parsimony score of a character to the minimum number 
of SPR transitions needed to obtain the character. 

Lemma 3. For every generating tree T* and all trees T, T where T and T differ by a 
prune- and-regraft of a subtree S and such that there exists f* £ BADy (T) \ BADy (T) 
the following holds. There exists a tree T' which differs from T by a prune- and-regraft of 
S and such that cut/j(T') < cut/j(T) for every partition R realized by single edges in T* 
and cut/j(T') < cut ji(T) for partition R realized by f* in T* . 

Proof. Suppose an edge /* G E{T*) is good for T and is bad for T. Let Li,L2 be the 
partition of the leaves induced by /* in T* . Thus, in T, there is an edge / = (f i, ^2) which 
partitions the leaves into Li and L2. See Figure 2 for an illustration of the setup. 




Figure 2: Edge /* is good for T. 



Let 5i denote the subtree "hanging off" of vi in T. More precisely, after deleting / 
from T, let 5i be the subtree containing vi. Let Li denote the leaves in Si. Similarly, let 
L2 denote the leaves and 5*2 denote the subtree hanging off of V2. Let v denote the root 
of the subtree S. 

First we claim that f ^ S. Suppose f £ S and without loss of generality suppose 
Si C S. See Figure 3 for an illustration of this case. Thus we must be grafting S into an 
edge of ^2 \ S. After such a move, the edge / still separates Li and L2, and thus /* is 
still good. Therefore, f ^ S. 

From now on, we assume, without loss of generality, that S C Si where S ^ Si. see 
Figure 4. We construct the tree T' by taking T, pruning S and then regrafting S along 
edge /, see Figure 5. 

Note that T is obtained from T by regrafting S onto an edge in 5*2 (otherwise /* would 
be good for T), see Figure 6. 
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tree T 




Figure 3: Case when f €z S, this scenario cannot occur. 




Figure 4: Case when f ^ S, this must be the scenario. 



The following claim says that the tree T' satisfies the conclusion of the lemma. 
Claim 4. For every partition R = (Ri, R2) of leaves realized by edges ofT*, it holds that: 

cutR(f ) > cut/j(r'), 

Moreover, for the partition R* = (Li,L2) (corresponding to f*), we have that: 

cut R*{f) > cut/j.(T'). 

The proof of the claim proceeds by constructing a cut in T' realizing (i?i,i?2) by a 
small modification of a cut in T realizing {Ri, -R2). Assuming the claim, the proof of the 
lemma is now complete. □ 

We now prove the above claim. 

Proof of Claim 4- We continue using the setup and notation from the proof of Lemma 3 
in Section 2.3. 

Recall, the claim says that for every partition R = {Ri, R2) of leaves realized by edges 
of T* that cut/j(r) > cutij(T') and for the partition R* = (Li,L2), cuti?.(f') > cutj?.(r'). 
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Figure 5: Construction of the tree T' . 




Figure 6: In tree T, S is regrafted into 82- 



First we argue that cuti?. (T) > cut/?* (T'). Note that cutij. (T) > 2 since /* is bad for 
T. On the other hand, cutR*(r') = 1 since cutting /" separates L\ and L2. Now we just 
need to argue that cut/?(T) > cut/j(T'). 

Let g* G E{T*) be an edge in T* . Let R = (i?i,i?2) be the corresponding partition in 
T* . Note that if g* is in the subtree with leaves Li then 

Ri C Li and R2 2 L2. (17) 

On the other hand, if g* is in the subtree with leaves L2 then 

R2 C L2 and Ri^Li. (18) 

Consider a minimum cut C C E[T) that realizes (i?i,i?2) in and amongst these 
minimum cuts is the one with the fewest number of edges in subtrees Si\S and S. 

We claim that vi is reachable from a leaf oi Si\S viiT\C and that v is reachable from 
a leaf of 5 in r \ C. Suppose that vi is not reachable from a leaf of Si \ S. Let e' be the 
edge in Cr\{Si\S) closest to vi. We claim that C = (C\{e'})U{/} realizes (RuRa)- If 
there was a pair of leaves in 5i \ 5* each in different Ri that are connected in T \ C" then 
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by the choice of e' one of those leaves would be connected to f i in T \ C, a contradiction 
with the assumption that vi is not reachable from a leaf of Si\ S in T \ C. Thus Ri and 
R2 are still separated in 5i \ 5 in T \ C"; Ri and R2 are still separated by C in S'2 U 5 
since C = C" in this subtree; and f £ C ensures that pairs across / are separated. Note 
that \C'\ < \C\ and C has fewer edges in 5i \ 5 and S, a contradiction with the choice of 
C. Thus f 1 is reachable from some leaf of 5i \ 5 in T \ C The argument for S and v is 
the same. 

Since a leaf of S is reachable from v in T\C, then in other words a (non-empty) subset 
of Ri and/or R2 are reachable from v. Moreover, since C realizes the partition (i?i,i?2), 
then a subset of only one of the Ri is reachable from t> in T \ C; we will say v is of type 
Ri to signify the Ri reachable from v. Analogously, we say Vi is of type Ri for the set 
reachable from vi. 

If V and vi are of the same type Ri , let 



We claim C realizes (i?i,i?2) in T' . To see this, note that if a path (between a pair of 
leaves) exists in T' and does not exist in T then it must include w which is the new vertex 
in T' where S is regrafted, see Figure 5 for an illustration. Now we argue that such a path 
can not connect a leaf in Ri with a leaf in R2 in T' \C' . Note that only a subset of Ri is 
reachable from w in T' \ C", since w can reach the same set of vertices (outside of S) in 
T' \ C as vi does in T \ C, and only a subset of Ri is reachable from v va. S\C . Finally, 
since \C'\ = \C\ we have that cutij(T') < cvXr{T) which completes the proof in this case. 

Now suppose vi is of type Ri and v is of type i?2- This means a leaf of S is in R2 
and since 5 C 5i, it is also in Li and we are in case (17), thus R2 2 L2. Note C has 
to separate vi from L2 by some set of edges Q <^ C. Let C = (C \ Q) U {/}. The new 
pairs of leaves that are connected in T' \ C (but not in T \ C) are either both from L2 
and hence R2, or are connected by a path that exists in T' and does not exist in T. As in 
the previous case, if a path (between a pair of leaves) exists in T' and does not exist in T 
then it must include w which is the new vertex in T' where S is regrafted. Note that w is 
disconnected from Si \ S in T' \ C (since / G C). The leaves of ^2 are from R2 and the 
leaves of S reachable from v in S \ C are also from i?2. Therefore the new paths do not 
connect leaves of i?i and R2- This completes this case since \C'\ < \C\. 

Finally, suppose vi is of type R2 and v is of type Ri . In this case a leaf of 5i \ S* is in 
i?2 and is also in Li and therefore we are again in case (17), thus R2 ^ L2- Note C has to 
separate v from L2 by some set of edges Q C C. Let C = {C \ Q) U {w, v}. Once again, 
the new pairs of leaves that are connected in T' \ C (but not in T \ C) are either both 
from L2 and hence R2, or are connected by a path that exists in T' and does not exist in 
T. Note that w is disconnected from the leaves of S in T' \C' . The leaves of ^2 are from 
i?2 and the leaves of 5i \ S reachable from vi are also from R2. Therefore the new paths 
in T' \ C do not connect leaves of Ri and R2. This completes this case since \C'\ < \C\. 




C if / ^ C 

(C\ {/})□{/'} if fee 
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This completes the proof of the claim. 



□ 



Using Lemma 3, we will prove that for every subtree S the optimal SPR move using 
S does not increase the number of bad edges, and there is a subtree S where the optimal 
SPR move using S decreases the number of bad edges. It will then be straightforward to 
prove rapid mixing by analyzing the time until the number of bad edges is zero, and hence 
we have reached T* . 

Lemma 5. For all trees T* , every choice of parameters a : E{T*) — )• M+, for all trees 
T 7^ T* the following holds, where A = At*, a is defined in (5). 

1. For any subtree S of T the following holds. Let T^i^ be any tree which minimizes 
^(^min) amongst the SPR neighbors of T which differ by a prune- and-regraft of S. 
Then, 

BADT.(r^in) CBADT.(r). (19) 

2. There exists a subtree S of T where the following holds. Let Tmm be any tree which 
minimizes j4(Tmin) amongst the SPR neighbors of T which differ by a prune- and- 
regraft of S. Then, 

BADt* (T^in) £ BADt* (T) . (20) 

Part 1 of Lemma 5 follows immediately from Lemma 3. To prove part 2 we choose 
a particular 'minimal' subtree S. Roughly speaking, we consider the bad edge /* that is 
closest to the leaves in T*, and take the subtree S hanging off of /*. 

Proof of Lemma 5. If (19) is violated then there exists /* G BADT*(Tmin) \ BADt.(T), 
and hence by Lemma 3, there exists T' (which differs from T and Tmin by a prune- and- 
regraft of S) such that no cuts increased in size and the cut corresponding to /* is smaller. 
Therefore, A{T') < A{T^i^), contradicting the choice of T^am- Therefore, part 1 holds. 

We now prove part 2. We first claim that there is an SPR move that decreases the 
number of bad edges. 

Claim 6. For every tree T, there is an SPR move resulting in a tree T' where 

BADT-(r') C BADT.(r). (21) 

Now we argue that part 2 of the lemma follows from the above claim and part 1 . We 
then go back to prove the claim. 

Consider a subtree S of T. Let Ns{T) denote those trees obtainable from T by a 
prune-and-regraft of S. Note, for any T' G Ns{T), we have that Ns{T') = Ns{T), since 
when we prune S from T and T' we have the same subtree remaining. 

Let T' denote the neighboring tree from Claim 6 with fewer bad edges, and let S denote 
the subtree where T' £ Ns{T). Let Tmin denote the tree in Ns{T) which minimizes 
A{Tmm)- As noted above we must have that Ns{T') = Ns{T). Thus, Tmin is also the 
neighbor of T' that minimizes A(T^[^). Therefore, we can apply part 1 of the lemma for 
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tree T' and subtree S, and conclude that BADy(Tmin) ^ BADr*(T'). Combined with 
(21) we then have that: 

BADT*(T^in) £BADr.(T), 
which proves part 2. □ 

We now prove Claim 6. 

Proof of Claim 6. Let /* in T* be an edge in BAD-p* (T) that is "closest" to the leaves in 
the following precise sense. Say /* joins subtrees S* and Z* in T* where the number of 
vertices in S* is at most the number of vertices in Z* . Then we say the distance of /* to 
the leaves is the number of vertices of S* . 

Note, by the choice of /*, S* contains no bad edges for T. First, note that S* must 
contain at least two leaves because, in any tree, any single leaf can be separated from the 
rest of the leaves by deleting one edge (which would contradict that /* is bad). Let S* 
and S2 denote the two subtrees of S* hanging from the root of S* in T* . Both S* and 
must exist since S* contains at least two leaves. 

Let Li and L2 denote the leaves in and S2 , respectively. Since /* is the closest bad 
edge to the leaves, there is a subtree Si in T whose leaves are Li, and also a subtree 52 
whose leaves are L2. Moreover, by induction. Si = S^ and ^2 = 5*2. In T, by pruning 52 
and then regrafting along the edge incident to Si we obtain a copy of S* in T. See Figure 
7 for an illustration. Let T' be the tree resulting from this SPR move. Note, /* is now a 
good edge in T' . 




Figure 7: Construction of the tree T' with fewer bad edges. 



It remains to argue that other edges of T* did not change from good for T to bad for 
T' . Note, edges in S^ and 52 remain good in T' since they are realizable in Si and 52, 
respectively. Consider an edge e* of T* where e* ^ 5*. Let (i?i,i?2) be the partition of 
the leaves realized by e* in T* . Note that Li, L2 are in the same partition, since tree 5* is 
not cut by e*. Let g be the edge in T that realizes {Ri, R2)- After pruning-and-regrafting 
52 (to form T'), g still realizes the partition (i?i,i?2) since Li and L2 are in the same 
partition. Hence, e* is stih good for T' . Therefore, GOODT*(r') D GOODT.(r) U {/*}, 
which completes the proof of the claim. □ 
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Finally, we prove that when the number of bad edges increases then A{T) also increases 
by a significant amount. As a consequence, in our analysis of the Markov chain, by taking 
e sufficiently small, we can focus on how a transition changes A{T) and hence on the 
change in the number of bad edges. 

Lemma 7. For any trees T and T' which differ by one SPR move, if jBAD^. (T')| > 
|BADT-(r)| then 

A{T') > ^(T^in) + 

Proof. Let S be the subtree used to move between T and T' . Let Ns{T) denote those 
trees obtainable from T by a prune-and-regraft of S. Note Ns{T) = Ns{T'). 

Consider T^[^ which minimizes A{T^[^) amongst the SPR neighbors of T which differ 
by a prune-and-regraft of S. Since Ns{T') = Ns(T') then Tmm is also the neighbor of T' 
that minimizes ^(Tmin)- Fix e G BADT*(r') \ BADT*(r). By Part 1 of Lemma 5, 

BADt- (Tmin) C BADt- (T) n BADt- (T'). 

□ 

2.4 Proof of Rapid Mixing: Theorem 1 

The proof of our main theorem now follows from a straightforward argument. We show 
that the Heat Bath SPR Markov Chain behaves like a local search algorithm, and then a 
simple coupling argument gives the mixing result. 

Proof of Theorem 1. Let T denote the space of phylogenetic trees on n taxa. For a tree 
T £ T, and subtree S of T, let Ns{T) denote those trees obtainable from T by pruning- 
and-regrafting S. 

Let C be the constant in the O(-) notation of (6) for the chosen amin and amax = 1- 
By choosing eq (note that Eq is an upper bound on e) sufficiently small then for every 
tree T, in (6), the Celnln(l/e) is smaller than |amin(elne)/10| and therefore: 

CTif^*)=S{7r*) + {A{T) + 6T)eln£, (22) 

for some |(5t| < amin/ 10. 

Fix a tree T ^ T* and a subtree S of T. By Lemma 7 and (22), for every T' £ Ns{T) 
where |BADr*(T')| > |BADt*(T)| we have: 

CT'{^^*) < ^T„,i„(/"*) - (9/10)a^ineln(l/e). 

For a character a G 1^", let D{a) = \{i : Di = a}\. A straightforward application of 
Hoeffding's inequality [ ] and a union bound over o" G fi" implies, for all > 0: 

Pr (for all a G 17", \D{a) - ^J*{a)N\ <5N)> 1-2- 4" exp(-252Ar). 
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Let gmin = niinjjen:i^j Qi^j denote a lower-bound on the off-diagonal entries in the rate 
matrix. For eq sufficiently small, every labeling of the leaves has probability at least e^""; 
this follows from the fact that for every edge, every transition has probability see 
(7) for a precise statement. Hence, by choosing eq sufficiently small (relative to Ominj^min 
and the constant in the error term of (7)), then for all a E il*^, > e^". Let 



for sufficiently large. Therefore, with probability > 1 — 4nexp(— lOn), the chain will 
move from T to some ^min (wliere ^min 

is a tree that can be obtained from T by an 
SPR move and such that it minimizes A(rniin)), and thus by part 1 of Lemma 5 the 
number of bad edges will not increase. Moreover, if we choose the subtree S satisfying 
part 2 of Lemma 5 then the number of bad edges will decrease. Hence, with probability 
> l/(4n) — 4nexp(— lOn) > l/(5n) the number of bad edges decreases by at least one. In 
expectation, after < 5n steps of the chain the number of bad edges will be zero, in which 
case we have reached T* . By Markov's inequality, with probability > 9/10 after 50n steps 
we reach T* . Once we reach T* the probability of moving to a different tree within 50n 
steps is at most 50n(4n)^ exp(— lOn) < 1/100. Hence the claimed mixing time follows by 
an elementary coupling argument (c.f., [12] for an introduction to the coupling technique) 
since from any pair of initial trees, both chains (run independently) reach T* at time 50n 
with probability > 1 — l/2e. □ 

3 Bayesian Inference 

The goal is often to randomly sample from the posterior distribution over trees. To do 
this, we consider a Markov chain whose stationary distribution is the posterior distribution 
and analyze the chain's mixing time, which is a measure of the convergence time of the 
chain to its stationary distribution. Let <I>(T, (5,t) denote a prior density where 



Our results extend to priors that are lower bounded by some 5 > as in Mossel and 
Vigoda [14]. In particular, for all trees T and all branch lengths t where tg < to for all 
edges e, we require <I>(T, (5*,t) > 5. We refer to these priors as (5, to)-i"egular priors. 



5 = Qmmeln(l/e)/(20 • 4"nlne). 



Then, for D ~ 
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Applying Bayes law we get the posterior distribution: 



Pr(r,g,t I D) 



flT,Q,t{^MT,Q,t 

Pr(D) 



where 



^ jQ'eQ Jt' 



f,T',Q',t'i^MT',Q',t' 



)dt'dQ. 



Each tree T then has a posterior weight 



w{T)= f f fiT,Q,t{^MT,Q,t)dtdQ 



jQeQ Jt 



(24) 



Finally, the posterior distribution ^ on trees is defined as /u(T) = w{T)/ J2t' ^(^')- 

3.1 Extension of Theorem 1 to Sampling the Posterior 

To sample from the posterior distribution, the Markov chain is defined as in Section 1.2 
except that in step 2 the weight w{e*) is now set as w{T*) defined in (24). This ensures 
that the Markov chain is reversible with respect to the posterior distribution, and hence 
this is the unique stationary distribution. 

Theorem 1 then extends to hold for any priors which are {5, 2eo)-i'egular The proof 
easily extends to this case in the following manner. 

In particular, we need to modify the statement of Lemma 2 so that, for any tree T, (6) 
is achieved for Q = Q* and for every set of branch lengths t where te G (e/2,2e) for all 
edges e. Then we can use the same proof as Lemma 21 in Mossel and Vigoda [14] to get 
an analog of (23) to hold for the posterior weights defined in (24) in place of the maximum 
likelihood function exp(£(D)), and the remainder of the proof of Theorem 1 remains the 
same. 

4 Discussion 

NNI Transitions: In a NNI transition, an internal edge e is chosen. Since internal 
vertices have degree three, there are four subtrees hanging off of e. There are three 
possible ways of attaching these four subtrees to e, and an NNI transition moves to one of 
these rearrangements. There are trees T (different from the generating tree T*) where no 
NNI neighbor (strictly) improves A{T); moreover, there are cases where there is also no 
improvement in the next term of (6). We are uncertain as to whether Theorem 1 holds 
for a Markov chain based on NNI transitions. It would be especially intriguing if there 
are cases where chains based on NNI transitions are slow to converge (so-called torpidly 
mixing), whereas a chain based on SPR transitions is provably fast to converge (rapidly 
mixing) . 



20 



Possible Future Work: There are now several works with proofs of convergence of 
MCMC algorithms for phylogenetic reconstruction in certain settings - rapid mixing re- 
sults in this paper and torpid mixing results in Mossel and Vigoda [13, 14] and Stefankovic 
and Vigoda [19, 20]). All of these results require that the branch lengths are sufficiently 
small so that only the dominant terms of the likelihood function need to be considered. 
A natural avenue for extending this paper, is to allow arbitrary branch lengths on the 
terminal edges. 

Rapid or Torpid Mixing for General Pure Distributions: The most tantalizing 
question to the authors is whether there exists a pure distribution (i. e., a single generating 
tree as in the setting of this paper) where Markov chains based on all of the natural 
transitions (e.g., NNI, SPR and TBR transitions) are slow to converge to the stationary 
distribution (in other words, they are torpidly mixing). We expect simulations can be quite 
useful for finding such a bad example, if one exists; in fact, our previous work [19, 20] on 
this topic was inspired by some intriguing findings from some simple simulations. 
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